Skip to content

wildcard expansion in vsearch bug fix #8307

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 23, 2025
Merged

Conversation

tom-brekke
Copy link
Contributor

A wildcard expansion is throwing an "arguments too long" error

gzip -n ${prefix}.${out_ext}* was throwing an "arguments too long" error - my dataset is quite large and there's over 50k files trying to be gzipped at this step.

I've fixed it by using find to get all the needed files and pipe them to xargs gzip.

It was still acting oddly on the HPC: some files ended up with two, three, or four ".gz"s at the end - so find must have been passing them to xargs gzip multiple times. Seems odd. Anyway that was easily fixed by adding the regex for a single digit ([0-9]) at the end of the pattern.

PR checklist

Closes #8305

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
    no test data added, but I've run it on my dataset and it works. Happy to discuss if needed.

gzip -n ${prefix}.${out_ext}* was throwing an "arguments too long" error. 
I've fixed it by using find to get all the needed files and pipe them to xargs gzip. 
It was still acting oddly on my computer (some files ended up with two, three, or four ".gz"s at the end - so it must have been passing them to gzip multiple times). That was easily fixed by adding the regex for a single digit ("[0-9]") at the end of the pattern.
wildcard expansion throwing an "arguments too long" error
@@ -60,7 +60,7 @@ process VSEARCH_CLUSTER {

if [[ $args3 == "--clusters" ]]
then
gzip -n ${prefix}.${out_ext}*
find . -name \"${prefix}.${out_ext}*[0-9]\" | xargs gzip -n
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be consistent throughout, if you've had this issue with --clusters, others might have the same issue with --samout. Also update that one no?

Also is it a given that vsearch always append a single digit to the end of the file?

Might also be could to specify to find that we're looking for files with -type f?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added in the -type f good call on that.

The --samout bit doesn't use a wildcard expansion like the --clusters does, so I think it should be ok as is. vsearch outputs many single-entry fastas, one for each centroid, whereas I believe that samtools makes a single multi-entry fasta so no wildcard expansion is needed. Happy to discuss further if I've missed something.

vsearch does append digits to the end when that --clusters flag is set, starting with 0 and counting up where there's one for each cluster centroid:

ASV_post_clustering.clusters.fasta0
ASV_post_clustering.clusters.fasta10000
ASV_post_clustering.clusters.fasta10001
ASV_post_clustering.clusters.fasta10002
ASV_post_clustering.clusters.fasta10003
...

So anchoring the regex with a final trailing digit does match all of those. The gzip then appends .gz and so the files end up looking like

ASV_post_clustering.clusters.fasta0.gz
ASV_post_clustering.clusters.fasta10000.gz
ASV_post_clustering.clusters.fasta10001.gz
ASV_post_clustering.clusters.fasta10002.gz
ASV_post_clustering.clusters.fasta10003.gz
...

which will not be matched again as now they no longer end with a digit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok great, then it's good to go I believe!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you!

@SPPearce SPPearce enabled auto-merge April 23, 2025 20:10
@SPPearce SPPearce added this pull request to the merge queue Apr 23, 2025
Merged via the queue into nf-core:master with commit f10b246 Apr 23, 2025
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

wildcard expansion triggering "argument too long" error
3 participants